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Abstract 

Recently, a successful pose estimation algorith- 
m, called Cascade Pose Regression (CPR), was 
proposed in the literature. Trained over Pose In¬ 
dex Feature, CPR is a regressor ensemble that is 
similar to Boosting. In this paper we show how 
CPR can be represented as a Neural Network. 
Specifically, we adopt a Graph Transformer Net¬ 
work (GTN) representation and accordingly train 
CPR with Back Propagation (BP) that permits 
globally tuning. In contrast, previous CPR liter¬ 
ature only took a layer wise training without any 
post fine tuning. We empirically show that glob¬ 
al training with BP outperforms layer-wise (pre- 
)training. Our CPR-GTN adopts a Multi Layer 
Percetron as the regressor, which utilized sparse 
connection to learn local image feature represen¬ 
tation. We tested the proposed CPR-GTN on 2D 
face pose estimation problem as in previous CPR 
literature. Besides, we also investigated the pos¬ 
sibility of extending CPR-GTN to 3D pose esti¬ 
mation by doing experiments using 3D Comput¬ 
ed Tomography dataset for heart segmentation. 


1. Introduction 

Recently an effective technique for object pose estimation, 
referred to as Cascade Pose Regression (CPR), is proposed 
by computer vision community (Dollar et al., 2010) and 
has seen successful applications, e.g., the long line of work 
in face pose estimation (a.k.a. face alignment) (Cao et al., 
2014; Burgos-Artizzu et al., 2013; Kazemi & Josephine, 
2014; Ren et al., 2014; Xiong & De la Torre, 2013). Basi¬ 
cally in CPR, the pose is represented as a set of landmarks. 
Then it learns a regressor that maps an unseen image to the 
coordinates of all the landmarks. 


CPR is essentially a regressor ensemble, summing up a 
number of stage regressors which are learned in a stage- 
wise greedy-forward manner. Each stage regressor predict- 
s a pose increment. CPR is thus very similar to Boosting, 
except that each stage regressor owns a private and unique 
feature pool that explicitly depends on the predicted pose 
up to its previous stage regressor. This way, CPR can grad¬ 
ually capture complicated pose and consequently be more 
robust to pose variations than a plain regressor. 

In this paper, we investigate CPR from another perspective. 
We show how to formulate CPR as a Neural Network (NN) 
with sparse connections that “encode” our prior knowledge 
(i.e., the domain knowledge of image data). Specifically, 
we propose a Graph Transformer Network (GTN) repre¬ 
sentation (LeCun et al., 1998) so that CPR is trained glob¬ 
ally with Back Propagations (BP). In contrast, the conven¬ 
tional CPR methods only adopt a layer wise training with¬ 
out any post fine tuning. 

In previous work, the CPR is applied exclusively to 2D 
pose estimation. In this paper, we investigate the feasibility 
to solve 3D pose estimation problem by applying it to 3D 
Computed Tomography (CT) image for heart segmentation. 
See Fig.l. 

1.1. Related Work 

Cascade Pose Regression. Traditional image features, 
e.g., SIFT(Lowe, 2004), HoG (Dalai & Triggs, 2005), are 
blind to pose variation as they only depend on image pixel 
values. To alleviate this, (Fleuret & Geman, 2008) pro¬ 
posed the Pose Index Feature (PIF) that depends on both 
pixel values and the object pose. A simple yet effective 
PIF implementation is to model the pose with a set of land¬ 
marks and extract traditional image features in the neigh¬ 
bourhood of each landmark. However, the coupling of PIF 
and the pose estimation is a chicken-egg dilemma. (Dollar 
et al., 2010) thus devised an iterative algorithm alternating 
between pose estimation and PIF extraction with an initial 
pose guess, where the random pixel difference is adopted as 
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Figure 2. The MLP and its GTN representation. Left: conventional MLP graph. Middle: the equivalent GTN. Right: look into the GTN 
module surrounded by dashed rectangle, i.e., Fi(')- 



Figure 1. Examples of the datasets used in this paper. Top: face 
pose estimation from 2D image, landmarks in red; Bottom: heart 
(Left Ventricle) segmentation from 3D CT image, three mutually 
perpendicular 2D slices shown for each 3D image. Landmarks 
are shown in small white solid balls. The triangular connections 
among landmarks (red straight lines) are just for the purpose of 
display and are not used in our CPR training. Both the inner and 
outer surfaces are shown in meshes. Best be viewed on screen. 
See the texts in Section 4 for more explanations. 


PIF. In this line, much of the later work (Cao et al., 2014; 
Burgos-Artizzu et al., 2013; Ren et al., 2014; Kazemi & 
Josephine, 2014) showed that CPR can be successfully ap¬ 
plied to an important computer vision problem - the face 
pose estimation (a.k.a. face alignment). Moreover, the Su¬ 
pervised Descent Method (SDM) (Xiong & De la Torre, 
2013), motivated by non-convex optimization, can also be 
viewed as CPR where the PIF is based on HoG feature and 
the stage regressor is a linear SVM. 

CPR with NN. In recent literature, NN was also proposed 
as the stage regressor for CPR. (Sun et al., 2013) and (To- 
shev & Szegedy, 2014) proposed to use Convolutional Neu¬ 
ral Network (CNN) (Le et al., 2012) for face pose and hu¬ 
man pose, respectively. (Zhang et al., 2014b;a) adopted 
Stacked Auto-Encoder. 

Layer-Wise (Pre-)Training. All the aforementioned CPR 
methods, involving NN or not, train the regressor ensemble 
in a Boosting manner. In terms of NN training, the CPR 
is trained layer-wise, without any post fine tuning. In this 
work, however, CPR is trained globally with BP. As a re¬ 
sult, an immediate question arises: can we take the layer 
wise training as pre-training which hopefully improves the 
performance of CPR? This issue is a little bit controversial 
in the NN literature. (Hinton & Salakhutdinov, 2006; Vin¬ 
cent et al., 2010) reported that unsupervised pre-training 
in a layer-wise way can improve the generalization of NN. 
(Girshick et al., 2014) showed that supervised pre-training 
on another datset with similar task can work well and unsu¬ 
pervised pre-training seems unnecessary. Moreover, (Cire- 
san et al., 2012) showed that no pre-training is needed at 
all provided the NN is carefully regularized and is appro¬ 
priately deep enough. Given this ambiguous information 
conveyed in the literature, in this work we compared the 
three types of training (i.e., layer-wise training, layer-wise 
pre-training, pure BP) empirically and found that a pure BP 
worked best. 

Spatially Local Connection for NN. In this work, we take 
the Multi-Layer Perceptron (MLP) as stage regressor, and 
we adopt a spatially local connection which is similar to 
the settings of CNN (LeCun et al., 1998) in order for local 
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image feature representation. However, we don’t use CN¬ 
N’s weight sharing, which is inspired by the observations 
in recent literature (Le et ah, 2012; Taigman et ah, 2014). 
The underlying concept will be discussed in Section 3.5. 

Face Pose Estimation with Auxiliary Information. 

When estimating face pose, auxiliary information can be 
helpful. Examples include: the rotation and scaling of head 
in 2D plane (Dantone et al., 2012; Smith et al., 2014) or 
3D space (Cao et al., 2013), human identity (Chen et al., 
2014; Cao et al., 2013), human gender and age (Zhang 
et al., 2014c), etc. Subsequently, the pose regressor training 
is formulated as multi-task learning or similar framework. 
These methods are shown to significantly improve the ac¬ 
curacy of pose estimation. However, we will not consider 
them. In this paper we’ll keep our focus on the connection 
between CPR and NN. 

CPR for 3D CT Image. Previous CPR work only stud¬ 
ied the pose estimation for 2D images. In this work, we 
investigate the feasibility of applying CPR to 3D pose esti¬ 
mation for 3D CT image segmentation, where the previous 
work in medical image processing literature usually adopt¬ 
ed an ASM/AAM based method (Zheng et al., 2008) or 
segmentation/edge-detection based method (Ecabert et al., 
2008). 

1.2. Summary and Outline 

The technical contributions of this paper can be summa¬ 
rized as follows: 

• We show how a CPR can be represented by a GTN. 
The derivatives of each GTN module, particularly the PIP 
extraction module, is provided. 

• We show CPR can be trained with pure BP algorithm, 
which outperforms layer-wise training (the way previous 
work on CPR adopts) or layer-wise pre-training in our ex¬ 
periments. 

• We show CPR stage regressor can be a Multi-Layer 
Perceptron (MLP), which can benefit from spatially local 
connection that learns local image feature representation. 

• We show the CPR-GTN can be extended to the prob¬ 
lem of 3D pose estimation from 3D CT image. 

The rest of this paper is organized as follows: in Section 2 
we briefly review GTN. In Section 3 we first review CPR 
and then show how it is formulated as GTN. In Section 4 
we show experiments on a public face pose dataset and on 
a 3D CT image dataset for heart segmentation. 

2. Review of Graph Transformer Network 

The concept of Graph Transformer Network (GTN) is well 
described in Section IV of (LeCun et al., 1998). Por com¬ 
pleteness of this paper, we briefly review it in this section 
with slight modifications on notations. 


X 




Xs 


Figure 3. A general GTN. 

Basically, GTN is a general representation of deep mod¬ 
el, which includes, but not confines to, feed-forward neural 
network with bi-directional list structure. Think of a simple 
MLP with one hidden layer as an example. The traditional 
graph is usually drawn as in the left of Pig. 2. However, it 
can also be given in GTN form as in the middle of Pig. 2. 
The module Fi (•) is a wrapper, which itself can be expand¬ 
ed to sub modules /i(-) and cri(-) as in the right of Pig. 2 
(Module p 2 (-) goes similarly.). 

Each module transforms its input variable to the output 
one. In this example, /i(-) : x i-A ixi is a linear trans¬ 
formation, while cri(-) : Ui i-A Xi performs a point-wise 
activation (e.g., sigmoid or tanh). Fi(-) is the composi¬ 
tion of /i(-) and cri(-) (p 2 (-) goes similarly). Pinally, 
£{•) : {x 2 ^y) 1—^ L calculates the loss where y is the la¬ 
bel associated with the instance and the loss L e is 
a non negative scalar. Note that the module £{•) is only 
used in training while it’s unnecessary in testing. To this 
extent, the feed-forward or back-propagation procedure for 
GTN can be seen as the message passing over the variables 
X, ixi, X 2---7 ^ in either forward or backward order. There¬ 
fore, the training and testing of GTN go exactly the same 
with the “traditional” graph drawn in the left of Pig. 2. 

In general case, however, GTN can be an Directed A- 
cyclic Graph (DAG), permitting the representation of more 
complicated deep model. What’s interesting is that feed¬ 
forward or back-propagation procedure still goes well with 
DAG if we restrict the variable order to be consistent with 
the parent-child relationship. Consider Pig. 3 as an exam¬ 
ple. When feed-forwarding, the transformation Fi(-) must 
be prior to F 2 (-), while the transformation p 2 (-) and Fs{’) 
can be in parallel. Similar requirement applies to back 
propagation. 

3. The Proposed Method 

In this section we show how to represent CPR by GTN so 
that CPR can be trained with Back Propagation. We begin 
with a brief review of CPR. 

3.1. Review of Cascade Pose Regression 

We demonstrate how CPR goes with the example of face 
pose estimation. A particular face pose is modeled as a set 
of L G N+ landmark points locating at eye corner, mouth 
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Figure 4. The CPR as a GTN 

corner, chin, etc. Denote the pose by p G which is 
a vector concatenating the x-y coordinates of the L points. 
Typically, L = 68. For a given image I G with 

the face fitting the image size, we train a regressor R{-) : 
^ j^ 2 L predict the face pose: p = R{I). 

Given the complicated pose variations including face ex¬ 
pressions, geometrical transformation, lighting, etc., learn¬ 
ing the regressor R{-) directly is not easy. (Dollar et al., 
2010) proposes to accomplish this in a cascaded manner 
so that the pose is predicted in a gradually refined way. 
Specifically, the method, called CPR, starts with an ini¬ 
tial pose guess p^. Inevitably, there is a difference, called 
residual hereinafter, Ap = p^ — p^ between and the 
ground truth pose p*. Then, a regressor i?^(/,p°), taking 
as input both image I and a pose p°, is fitted with least 
square to predict the residual Ap. After this is done, hope¬ 
fully the guess is updated to be a more accurate predic¬ 
tion p^ ^ p° + i?(/;p°) while the residual is reduced to 
be Ap ^ p* — p^. Subsequently, based on p^ we train 
i?^(/;p^), and so on. We end up with a combination of 
T G regressors: 

T 

R{i) = p° + Y,R\rp*-^), ( 1 ) 

where the pose is updated incrementally 

+ ( 2 ) 

for Stage t = 1, 2,..., T. In equation (1) or (2), i?^(/,p^“^) 
is referred to as stage regressor, which is learned at stage t 
based on the pose estimation p^~^ up to last stage. 

3 . 1 . 1 . Pose Index Feature 

The stage regressor i?^(/,p^“^) works over the so called 
Pose Index Feature (PIF). Differen from conventional im¬ 
age feature depending only on image /, the PIF depends 
on both image I and the pose estimation p^~^ at last stage. 
A simple yet effective implementation is to first extract the 
conventional features in a small neighborhood of each pose 
landmark and then concatenate/pool them. This way, the 
feature would be hopefully invariant to pose variations and 
thus be more reliable. (Cao et al., 2014; Ren et al., 2014; 
Burgos-Artizzu et al., 2013; Kazemi & Josephine, 2014) u- 
tilized Random Pixel Difference as underlying convention¬ 
al features, while (Xiong & De la Torre, 2013) resorted to 



Figure 5. The stage regressor R^{-). 

HoG. In this work we simply adopt the Random Pixel D- 
ifference feature, which means taking many random point- 
pairs and concatenating the difference values. 

3.1.2. Layer-Wise Learning 

(Dollar et al., 2010) proposed to learn CPR in a for¬ 
ward, greedy stage wise way that is similar to Boosting 
(Friedman et al., 2000). For a regressor consisting of T 
stage-regressors, it iterates exactly T times and learns just 
one stage regressor per iteration based on those regres¬ 
sors learned in previous iterations. Specifically, at itera¬ 
tion 1 < t < T, only the stage regressor R^{I]p^~^) 
is learned over the features extracted based on p^~^ = 
p^ + which implicitly depends on al- 

1 previously learned regressors. This is almost Boosting- 
style learning except that the feature pool is changed from 
one iteration to another due to the introduction of PIF. This 
type of learning for CPR is followed by other authors (Cao 
et al., 2014; Ren et al., 2014; Burgos-Artizzu et al., 2013; 
Kazemi & Josephine, 2014). The SDM (Xiong & De la 
Torre, 2013), although motivated by pure optimization is¬ 
sue other than CPR, also learns in such a layer-wise man¬ 
ner. 

3.2. CPR as GTN 

In this subsection we show how the CPR in last subsection 
can be formulated as a GTN. Recalling equation (1), (2) 
and using the graphic notations reviewed in Section 3.1, 
we can express the regressor (1) by a GTN shown in Fig.4. 

We now explain each variable, each module and associat¬ 
ed transformation in Fig.4. The loss module £{■) calculates 
the least square error between the pose estimation p^ at last 
stage and the ground truth pose p*. The module “mtx” is 
a multiplexer that replicates its input at each of its output. 
The module i?^(-) is the stage regressor (1), taking as input 
both image I and pose estimation p^~^ and outputting the 
pose estimation p^. Looking inside, the R^{-) can decom¬ 
pose into the sub-modules as in Fig. 5. 

In a slight abuse of notations and without any confusion, 
we have renamed the input and output variables in Fig. 5 for 
simplicity and avoiding name conflict. The “mtx” is again 
a multiplexer. The module “-r” takes the sum of the input 
pose and the predicted residual: p' = p + q. The module 
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0(-) is a PIF extractor that takes as input the current pose 
estimation p as well as the raw image I. The output feature 
X is fed into a regressor /(•) to predict the pose residual q. 
In the following we give more details on 0(-) and /(•). 

As is described in Section 3.1.1, the PIF x = (j){I^p) can 
be simply built based on a conventional image feature by 
restricting the feature extraction in the neighborhood of 
each landmark and then concatenating them. In this study 
we follow the line of (Cao et al., 2014; Ren et al., 2014; 
Burgos-Artizzu et al., 2013; Kazemi & Josephine, 2014) 
and adopt the random pixel difference features. As for the 
regressor /(•), we let it be an MLP with only one hidden 
layer. However, we only allow spatially local connection 
for the hidden layer instead of the full connection as in con¬ 
ventional MLP, which will be discussed in Section 3.5 in 
greater details. 

3.3. Training with Back Propagation 

As is pointed out in (LeCun et al., 1998), the GTN can be 
well trained with the classical BP algorithm if each module 
is differentiable w.r.t. its input variables and parameters (if 
any) almost every where. This is the case for the CPR-GTN 
proposed in this paper, as is explained in the following. 

Since the repeated structure of CPR-GTN as in Fig.4, let’s 
just check how BP goes for R^{-) as in Fig. 5. /(•) is 

an MLP, whose BP is standard in textbook, e.g., (Bishop 
et al., 1995), even in the case of sparse connection (See 
Section 3.5). The BP for module “mtx” is at input taking 
the sum of the delta-signals at each output, while the BP 
for module “-r” is at each input taking the replication of the 
delta-signal at the output. This symmetrical criterions are 
described in (LeCun et al., 1998). The only thing that needs 
a little bit more explanations is the BP for the PIF module 
0(-), which is given in the followed contents. 

Keep Fig.4 in mind, let’s start from the simplest case. Let 
there be only one landmark such that p G M^, each entry 
corresponding to horizontal and vertical coordinate respec¬ 
tively. Let the randomly picked two points deviate from p 
by di G and d 2 G respectively. In a slight abuse 
of notation and without confusion, for image I we let the 
scalar random pixel difference feature x be 

X = (j)[pj) = I{p +di) - I[p +d2), (3) 

where I(p) denotes the image pixel value at point p. From 
(3) we have the derivatives w.r.t. p as: 

— = V/(p + di)-V/(p + d2), (4) 

where V/(p) G denotes the 2D image gradient vector 
at point p. Likewise, we have the derivatives w.r.t. I as: 

3t 

— = (i(p + c^i) - 6{p + d2), (5) 


where 6{p) denotes the pulse response at point p, i.e., it 
takes value 1 at p and value 0 at other position. Note that 
(5) is a matrix with the size of image I. 

Now, in Fig.4 denote by L the loss and denote by the 
scalar delta-signal propagated into the output. According to 
chain rule in calculus, we have the delta-signal propagated 
to input variables p and I to be 


dL 

dx dL 

(6) 

dp 

dp dx 

dL 

dx dL 

(7) 

in 

dl dx ’ 


respectively, where we should substitute (4) and (5) for 
concrete calculations. Note that G and is with 
the same size of image /. 

In general case, there are L > 1 landmarks. Moreover, in 
the neighborhood of each landmark there can be M > 1 
random point pairs. Therefore, equation (4), (5), (6), (7) 
should take appropriate matrix form. The mathematical de¬ 
tails go similarly and are omitted here. 

Remark 1. The smoothness of gradient. For a discrete 
digital image, the pixel value I (p) is not strictly differen¬ 
tiable w.r.t. to the position p. Therefore, V/(-) in equa¬ 
tion (4) can only be calculated approximately. A common 
technique in image processing is the 2D convolution over 
image I with a carefully designed template that simultane¬ 
ously mimics the gradient and enforces smoothness. In this 
paper we simply adopt the Sobel template (Gonzalez et al., 
2004). 

Remark 2. 0(p, /) being vector features. In the case of 
one landmark, the random pixel difference feature x in 
(3) is the substruction of raw image pixel values and is 
thus a scalar. However, the feature can also be a vector 
X G with D G N+. To calculate equation (4) in the 
case of vector, we can simply view x as an image with D 
channels and calculate the 2D gradient independently for 
each of the channel. For instance, suppose we are building 
X = (/){!, p) G on top of HoG-like features (Dalai 
& Triggs, 2005; Xiong & De la Torre, 2013) which ex¬ 
tracts the gradient histogram for 8 orientations over each 
of the 4 X 4 grids (D = 128 = 8 x 4 x 4) on a rectan¬ 
gle image patch centered at some landmark p. To obtain 
G we can first calculate dense HoG over every 

point p in image and then calculate the 2D Sobel gradient 
over each of the 128 channels independently. We end up by 
fetching the results at the interested point p and concatenat¬ 
ing the 128 2D gradient vectors. 

Remark 3. The derivative w.r.t. I for visualization. At first 
glance, the derivative (5) and (7) seem useless. For a cer¬ 
tain training image /, it does not make sense to change its 
pixel values when training so that the delta-signal for I can 
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Figure 6. Spatially local connection. Left: landmarks (red cross¬ 
es), features (blue points) and locality (circle). Right: the corre¬ 
sponding MLR See texts for explanations. 

be ignored. However, (5) and (7) are indeed useful for oth¬ 
er purpose, e.g., the visualization which will be discussed 
later in this section. 

Finally, in Fig. 4 the feed forward procedure for all the mod¬ 
ules are straightforward. To this extent, we’ve got all the 
details to train CPR-GTN with BP. The parameters can be 
updated using Stochastic Gradient Descent (SGD), as is 
discussed in Section 3.6. 

3.4. Pure BP and Layer-Wise (Pre-)Training 

In this sub section we discuss more on CPR training. In 
last subsection we’ve shown it can be trained with pure BP. 
In the meanwhile, in previous computer vision literature 
(Dollar et al., 2010; Burgos-Artizzu et al., 2013; Cao et al., 
2014; Kazemi & Josephine, 2014) the CPR is learned in a 
layer-wise way that one layer is added once a time. 

On the other hand, in recent NN literature it is believed 
that unsupervised (Hinton & Salakhutdinov, 2006; Vincen- 
t et al., 2010) or supervised (Girshick et al., 2014) pre¬ 
training is a critical technique that improves the perfor¬ 
mance of deep NN structure. For example, a Restrict¬ 
ed Boltzman Machine and Stacked Auto Encoder can be 
learned in a layer-wise manner. Then the structure is fixed 
and the parameters are preserved, serving as the initializa¬ 
tion of a further “fine-tuning” by using BP. This strategy is 
believed to be better than a pure BP with random parame¬ 
ter initialization. However, there is also later work (Ciresan 
et al., 2012 ) showing that pre-training is not needed at al- 
1 provided the deep structure is regularized carefully and 
trained sufficiently. It is still unclear whether a pre-training 
should be taken or not - a convincing theoretical explana¬ 
tion seems absent in the literature. 

Given the short discussion above, for the CPR-GTN in this 
paper we are immediately facing three types of training: 
1. A pure BP with random initialization; 2. A layer-wise 
training as in aforementioned computer vision literature; 3. 
A layer-wise pre-training that is in-between 1 and 2, i.e., 
first perform layer-wise training, then take it as initializa¬ 
tion and continue training with BP. Accordingly, we com¬ 
pare the three types of training empirically and find that the 



Figure 7. The synthesized image for a ground truth pose. Left: 
the average image as initialization; Middle: the most likely im¬ 
age; Right: the ground truth pose with its face image. 


pure BP performs best, see Section 4.1. 

3.5. Spatially Local Connection 

In Fig. 5, the module /(•) : x ^ q can be arbitrary re¬ 
gressor provided it is differentiable w.r.t. its input and pa¬ 
rameters which perpmit BP ^ . An immediate choice seem- 
s a fully connected MLP In this paper we adopt a sin¬ 
gle hidden layer MLP such that a = a{Wix bi) and 
q = W 2 a + 62 , where the hidden variable a can be viewed 
as an intermediate feature representation that has higher ab¬ 
straction than feature x. However, it is well known that 
fully connected MLP tends to be over-fitting due to its too 
large capability of representation. To alleviate it, we could 
restrict the connection between x and a to be spatially lo¬ 
cal. 

We explain the spatially local connection by an example as 
in the left of Fig. 6 . Suppose there are L landmarks (red 
crosses) and M features (blue points, each indicating the 
middle point of a random point-pair) such that we have in 
total X G where Ni = ML. Now look at the hidden 

layer a G for some N2 G N+. For the first component 

of a, we denote it by the centroid of a circle and let it on¬ 
ly connect to those blue points inside the circle. The other 
N 2 — 1 components of a go similarly. The up to N 2 circles 
should scatter reasonably and their union should approx¬ 
imately cover all the Ni blue points. The corresponding 
MLP is drawn as in the right of Fig. 6 , where we resort to 
conventional graphic notations of feed forward network to 
emphasize its sparse connection intending for local feature 
representation. In this paper we simply let q be fully con¬ 
nected to a, although local connection seems also make 
sense. 

Remark. Our choice for a is similar to the “feature map” 
produced by a convolutional layer in CNN. However, there 
is noticeably difference: 

• First, the convolution in CNN directly applies to every 
raw image pixels, while our a takes as input the random 
pixel difference features - such high level features are spa¬ 
tially scattered by construction so that they are very likely 

^Therefore, we have to exclude Tree regressor which is popu¬ 
lar in Boosting or Random Forest but is not differentiable. 
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Figure 8. Various types of training: epoches vs testing error. 

to be mutually uncorrelated and to be complementary. In 
contrast, CNN needs additional sub-sampling (or pooling) 
layer to eliminate the correlations for the feature map, or it 
would compromise to noises of close features and hence¬ 
forth lose the property of translation-invariant feature rep¬ 
resentation. 

• Second, the convolution in CNN actually adopts both 
spatially local connection and weight sharing. In this paper, 
however, the weight sharing is explicitly dismissed when 
constructing a. This technique was proposed in previous 
literature (Taigman et al., 2014; Le et ah, 2012) and the 
considerations are explained therein. We rephrase them 
here. Suppose the task of face recognition(Taigman et al., 
2014). The features having high response for eyes should 
be intuitively different with that for mouth, therefore it suf¬ 
fices to let the filter kernel move in a small image patch and 
thus yield a corresponding small feature map (or even let it 
be fixed on a single pixel, yielding feature map with only 
one element (Le et al., 2012), which is exactly our choice in 
this paper). This would enforce the specificity of the fea¬ 
ture representation, i.e., it helps produce features that are 
specialized in pupil, eyebrow, tooth, etc. 

3.6. Implementation Details of Training 

We discuss the geometrical transform of pose, the data aug¬ 
mentation trick and the parameter tuning for SGD. See the 
supplement for details. 

3.7. Visualization: Synthesize Image from Pose 

As a bonus of GTN representation, the trained CPR is 
rather easy to visualize (Simonyan et al., 2013), that is, for 
a given ground truth pose p*, what’s the most likely image 
I the regressor R(-) is fed so that p* = An example 

of the synthesized image for a given pose is shown in Fig. 7. 
Details are put in supplement. 


Figure 9. Various local connections: epoches vs testing error. 

4. Experiments 

We did experiments on a public 2D face pose dataset called 
300-W and a 3D pose estimation dataset in CT images for 
hear segmentation, as described in the following. 

300-W (testing on ‘‘common subset”). The source is 
plain 2D images of human face, where the face pose 
is modeled by a set of L = 68 landmarks (Sago- 

nas et al., 2013). The 300-W dataset consists of sev¬ 
eral subsets (LFPW, Helen, AFW, etc.) which are 
publicly available at http : //ibug .doc.ic.ac.uk/ 
resources/facial-point-annotations/. In 
this paper, we used the full training set containing 3148 
images, while we tested on the so-called “common subset” 
defined in (Ren et al., 2014) including 689 images. Exam¬ 
ples are shown in Fig.l. 

CT Images. The source is 3D CT images, each image in 
the size X x Y x H where X = Y = 512 and H ranges 
from 150 to 210. The X-Y plane resolution ranges from 
0.33 to 0.43 millimeters per pixel while the Z axis reso¬ 
lution ranges from 0.60 to 0.62 millimeters per pixel. We 
are interested in segmenting the left ventricle (LV) of the 
heart^. The LV is a bowl-like cavity, therefore we need de¬ 
lineate both an LV’s inner surface and outer surface, as is 
in Fig.l . We approach this by pose estimation based on 
supervised learning. Specifically, we adopt L = 172 land¬ 
marks (86 for the inner surface and 86 for the outer surface) 
to fully capture an LV’s pose in a CT image. We randomly 
assign 112 CT images for training and 27 for testing, with 
each’s landmarks labeled by a human-expert. 

4.1. Pure BP vs layer-wise (Pre-)Training 

In thie subsection we compare three types of training: BP, 
layer-wise training and layer-wise pre-training, whose mu¬ 
tual relation has been discussed in Section 3.4. We intend 
to answer such a question: which type of training is most 
efficient? Or equivalently, how fast does each type of train- 

^A heart has four chambers, LV is the most interesting cham¬ 
ber to the clinical doctors. 
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Table 1. Testing error on 300-W. 


LBF 

LBF fast 

SDM 

This Paper 

4.95 

5.38 

5.60 

6.76 


ing decrease the testing error? 

We set a CPR with T = 8 stages (layers). We run the 
three types of training on 300-W dataset and fixed the total 
number of SGD iterations to be 60 epoches. We distributed 
these 60 epoches to layer-wise pre-training and the follow¬ 
ing BP in different ways to figure out its effect on the test¬ 
ing error decrease. Results are reported in Fig. 8, in whose 
legend the format “m-n” means m epoches of layer-wise 
pre-training followed by n epoches of BP. Therefore, 60-0 
means only the layer-wise training with each layer trained 
by 60 SGD epoches; 0-60 means only the 60 epoches of 
BP with random initialization; 15-45, 30-30 and 45-15 are 
all in-between layer-wise pre-training. Note that in Fig. 8 
the layer-wise pre-training is simply plotted as flat curve, 
which should not lead to any confusion. 

The testing error reported in Fig. 8 is the so called pupil- 
distance normalized mean distance (in percentage %), 
which is a widely adopted measure in the face pose esti¬ 
mation literature (Ren et al., 2014) (Xiong & De la Torre, 
2013) (Dantone et al., 2012) (Burgos-Artizzu et al., 2013). 
The same measure is reported for all our experiments on 
face pose. 

As we can see, BP’s curve is almost below all the others, 
which suggesting that BP decreases the fastest. The test¬ 
ing error decreases slower and slower with more and more 
epoches of layer-wise pre-training entered. Finally, an ab¬ 
solute Layer-wise training performs the worst. 

4.2. The Spatially Local Connection 

As is discussed in Section 3.5, the spatially local connec¬ 
tion is beneficial. In this subsection, we investigate the im¬ 
pact by varying the locality of connections as follows. 

We let the regressor /(•) be a one hidden layer MLR For 
a pose with L landmarks, we extracted M features around 
each landmark, constituting L x M features: x e 
For the intermediate feature map representation a (i.e., the 
hidden layer of MLP), we let its number of features (i.e., 
the number of neurons) be a G with H < M io 

learn a compact feature representation. By this settings, 
the components of a can be exactly divided into L groups, 
each group corresponding to one landmark. For a feature 
in a associated with landmark i, we let it connect to those 
X features who are from landmark Fs A:-nearest-neighbor 
landmarks, i.e., each feature in a connect to Lk features 
in X. In our experiments, k G {1, 3, 5,12,68}, as in the 
legend of Fig. 9 where we report the testing error on LFPW 
dataset (a subset of 300-W, see the beginning of this sec- 


Table 2. Mesh distance (in millimeters) for heart (Left Ventricle) 
segmentation. 



AAM/ASM 

This Paper 

Do Nothing 

inner surface 

1.13 

3.35 

7.24 

outer surface 

1.21 

3.41 

7.31 


tion) including 811 images for training and 224 for testing. 

Note that knn-68 is in effect the full connection. Besides, 
we followed the observation in (Ren et al., 2014) and tried 
a shrinking settings (“knn-var” in the legend): knn-5 for 
the first third stages, knn-3 for the second third and knn- 
1 for the last third. This coarse-to-fine way is to capture 
the large displacement in early stages and to refine slightly 
in late stages. The specific settings in our experiments are 
L = 68, M = 15, iT = 5, the number of stages T = 24. 

As can be seen, smaller k in knn tends to better results. 
However, the shrinking knn works best. This is consistent 
with the observations of (Ren et al., 2014) where the ran¬ 
dom pixel difference features are extracted with shrinking 
circle radius determined by cross-validation. 

4.3. Results on Face Pose Data 

We report the testing error on the face pose dataset 300-W 
(testing on common subset) as in Tab.l, where the parame¬ 
ters for our algorithm are M = 15, T = 24, shrinking knn 
as in Section 4.2. We also report the state-of-the-art results 
quoted from (Ren et al., 2014), where LBF and LBF fast 
were proposed in (Ren et al., 2014), SDM was proposed in 
(Xiong & De la Torre, 2013), . Our results seems inferior 
to the competitors. However, we conjecture that it’s due to 
we are using a very simple regressor (i.e., a one hidden lay¬ 
er MLP) at each stage, while (Ren et al., 2014) utilized a 
tree ensemble with carefully tuned tree-depth and leaf val¬ 
ues, which seems to have better generalization than a one 
hidden layer MLP. Also, we conjecture that a stronger PIF, 
e.g., one usees HoG as in (Xiong & De la Torre, 2013), 
would improve our results. 

It would be our future work to test stage regressor /(•) with 
deeper structure. In the meanwhile, we anticipate that the 
proposed method in this paper would show its power with 
bigger training data as what has been observed for other 
deep neural network based methods, which would be also 
investigated in future work. 

4.4. Results on 3D CT Images 

As is discussed in the beginning of this section, the heart 
segmentation is converted to a 3D pose estimation prob¬ 
lem, which goes similar with the 2D face pose estimation, 
except that for the testing error we adopt another measure 
that is more popular in medical image processing commu¬ 
nity (Zheng et al., 2008). Specifically, we calculate the so- 
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called mesh distance between the predicted pose and the 
ground truth pose. Roughly speaking, the mesh distance 
is defined to indicate the closeness of two surfaces that are 
respectively comprised of triangles. 

Following the convention of Medical Image literature, we 
report the mean mesh distance in millimeter over testing 
set for both the inner surface and the outer surface, as in 
Tab.2 where the baseline is “do nothing”, i.e., the distance 
between initial guess and ground truth. We also quote 
a state-of-the-art result for heart segmentation based on 
AAM/ASM (Zheng et al., 2008). Note that the results there 
are obtained on their own proprietary datasets with a dif¬ 
ferent scan parameters and image resolutions. The quoted 
results serve as a reference and should not be directly com¬ 
pared with ours. 

5. Conclusions and Future Work 

In this paper we propose a GTN representation for CPR 
so that it can be trained with BP globally which outper¬ 
forms layer-wise (pre-)training. Sparse connection applied 
in order to learn local image feature representation. The 
proposed CPR-GTN is promising for 2D and 3D pose esti¬ 
mation. We’ll explore the potential of CPR-GTN by testing 
it on much bigger data in the future. 
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1. Implementation Details of Training 

When extracting the random pixel difference features, the 
random points zi, ...z^,... are generated in a coordinate de¬ 
fined by mean face pose p, where each point zi is anchored 
to its nearest landmark. For each training image /, we first 
determine the similarity transform A from mean pose p to 
currently estimated pose p: p = A{p), then we transform 
these points accordingly: A(zi),..., A(z^),... which are the 
very points that we extract features for image I. This op¬ 
eration is more robust to rotation and scaling of face pose. 
See (Cao et al., 2014; Kazemi & Josephine, 2014) for de¬ 
tails. 

In line of (Dollar et al., 2010; Burgos-Artizzu et al., 2013; 
Cao et al., 2014; Kazemi & Josephine, 2014), we take a 
data augmentation trick which has been shown to improve 
CPR’s performance. 

We follow the suggestions in (Srivastava et al., 2014) for 
the NN regularization. Specifically, Rectified Linear Unit 
(Relu) is taken as activation function and Dropout with rate 
0.5 is adopted. Stochastic Gradient Descent (SGD) is used 
with mini batch size 50. Parameters are updated with step 
size 0.1 and momentum 0.9. Other tuning parameters in¬ 
volving the CPR are discussed in the experiment section in 
the main paper. 

2. Visualization: Synthesize Image from Pose 

As a bonus of GTN representation, the trained CPR is 
rather easy to visualize (Simonyan et al., 2013), that is, for 
a given ground truth pose p*, what’s the most likely image 
I the regressor R{-) is fed so that p* = R{-)1 Details are 
put in supplement. 

A trained CPR can be seen as a regressor mapping image to 
pose: p = R{I). Now consider its inverse problem: given 
a ground truth pose p*, what’s the most likely image I the 
regressor R{-) is fed? An interesting by-product of CPR- 


GTN is that it makes such a visualization problem rather 
easy. Specifically, we want 

argminL = Up* - i?(J)|| 2 . (1) 

I 

Since we’ve formulated CPR as NN permitting BP, the 
above optimization problem can be solved by a technique 
introduced in (Simonyan et al., 2013). Basically, we can 
perform gradient descent with BP. To do this, we simply let 
the delta-signal pass all the way down to the variable I and 
update I with —^§7 wehre 77 is the step size. The initial 
value I is set to the average face image over the training 
set. 
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Updated Contents 


1. Related Work 

One of the ICML2015 reviewers pointed out that the idea of 
global tuning of CPR was already introduced in (Shi et ah, 
2014), which we were unaware of when we did this work 
during December 2014 to February 2015. The only differ¬ 
ence with (Shi et ah, 2014) is that we adopt Random Pixel 
Difference features and local connection, while (Shi et ah, 
2014) adopts HoG feature and fully connection. Besides, 
(Shi et al., 2014) tried to get rid of the initial pose guess by 
a global regression but saw a degraded performance. 

2. Updated Experimental Results 

After the submission of the preliminary manuscript to 
ICML2015, we continued our work and got some new re¬ 
sults on 300-W datasets, as reported in Table 1. The im¬ 
provement over our previous result is due to a more aggres¬ 
sive data augmentation (up to x200 expansion of the orig¬ 
inal training dataset, compared to x20 in our initial work) 
and a technique of summing up many mid-way losses as 
introduced in (Lee et al., 2014). 

In Table 1 we can see that Deep Reg (Lee et al., 2014) is 
still better than this work, although they are based on al¬ 
most the same framework. Our hunch for this difference is 
that HoG is used in Deep Reg, which could be more robust 
than Random Pixel Difference feature used in this work. 

Note that every stage in CPR outputs a pose prediction. In 
Figure 1 we show the an example of these mid-way predic¬ 
tion with or without the technique in (Lee et al., 2014). We 
can see that the mid-way prediction is more interpretable 
after introducing the sum of the mid-way losses. Although 
the mid-way prediction does not directly relate to the final 
prediction at last stage, enforcing the mid-way result to be 
more like the ground truth during training seems a good 
regularization. See (Lee et al., 2014) for more details. 

3. Code 

Our initial code is based on Matlab CPU computation, 
which is very slow and prevents us from leveraging param¬ 
eter tuning. We thus rewrite the code with GPU accelera¬ 
tion. The new code is available at https : //github. 
com/pengsun/bpcpr5. The whole Directed Acyclic 
Grapch (DAG) “framework” is built on top of a GPU accel- 



Figure I. Intermediate results for tunable CPR. Red points: 
ground truth. Blue points: prediction. Column 1 to column 3: 
T = 0 (initial guess), T = 4 and T = 24. Top row: with subur- 
gos2013m of mid-way losses. Bottom: only one loss transformer 
at the last stage. 


Table 1. Average Distance on the testing set of 300-W (Normal¬ 
ized by Pupil Distance). The results for ESR, SDM, LBF (fast) 
are quoted from (Ren et al., 2014), Deep Reg from (Shi et al., 
2014) 


Method 

Eullset 

Common-set 

Challenging-set 

ESR 

7.58 

5.28 

17.00 

SDM 

7.52 

5.60 

15.40 

LBE 

6.32 

4.95 

11.98 

LBE fast 

7.37 

5.38 

15.50 

Deep Reg 

6.31 

4.51 

13.80 

This work (new) 

7.46 

5.24 

16.56 

This work (old) 

NA 

6.76 

NA 
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crated Matlab toolbox for Convolutional Network ^ . Ad¬ 
ditionally, we implement a GPU accelerated transformer 
for Random Pixel Difference feature extraction. The speed 
for our code is approximately 150 images per second with 
T = 24 stages on a typical contemporary GPU. 
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